3주차 - 딥러닝을 이용한 자연어처리 입문(위키독스)
NLP를 위한 합성곱 신경망¶
1) 합성곱 신경망¶
- 이미지 처리에 효율적인 신경망
- 크게 Convolution layer와 Pooling layer로 구성됨.
1. 합성곱 신경망의 대두¶
- 다층 퍼셉트론을 이용하여 이미지 처리를 진행했었음.
- 사진을 2차원 텐서인 행렬로 표현하며, 각 픽셀의 입력값으로 구분했었음.
- 하지만 2차원 텐서는 공간적인 구조(spatial structure) 정보를 보존할 수 없음.
- 이를 대체하기 위해 합성곱 신경망이 사용됨.
채널¶
- 기계는 이미지보다 숫자, 즉 텐서를 더 잘 처리할 수 있음.
- 이미지는 (높이, 너비, 채널)이라는 3차원 텐서이다.
- 높이: 세로방향 픽셀수 (0 ~ 255)
- 너비: 가로방향 픽셀수 (0 ~ 255)
- 채널: 색 성분 (흑백 이미지는 채널 수가 1, 통상적인 컬러 이미지는 채널 수가 3)
합성곱연산 (Convolution Operation)¶
- Convolution Layer에서는 합성곱 연산을 통해 이미지의 특징을 추출한다.
이미지의 가장 왼쪽 위부터 가장 오른쪽까지 순차적으로 훑게 됨.
다수의 채널을 가질 경우 합성곱 연산¶
- 커널의 각 채널들은 같은 크기여야 함.
채널 간 합성곱 연산을 마친 이후에 그 결과를 모두 더해서 하나의 채널을 가지는 특성맵을 만든다.
2) 1D CNN으로 IMDB 리뷰 분류하기¶
from tensorflow.keras import datasets
from tensorflow.keras.preprocessing.sequence import pad_sequences
vocab_size = 10000
(X_train, y_train), (X_test, y_test) = datasets.imdb.load_data(num_words = vocab_size)
print(X_train[:5])
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz 17465344/17464789 [==============================] - 0s 0us/step [list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]) list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]) list([1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5974, 54, 61, 369, 13, 71, 149, 14, 22, 112, 4, 2401, 311, 12, 16, 3711, 33, 75, 43, 1829, 296, 4, 86, 320, 35, 534, 19, 263, 4821, 1301, 4, 1873, 33, 89, 78, 12, 66, 16, 4, 360, 7, 4, 58, 316, 334, 11, 4, 1716, 43, 645, 662, 8, 257, 85, 1200, 42, 1228, 2578, 83, 68, 3912, 15, 36, 165, 1539, 278, 36, 69, 2, 780, 8, 106, 14, 6905, 1338, 18, 6, 22, 12, 215, 28, 610, 40, 6, 87, 326, 23, 2300, 21, 23, 22, 12, 272, 40, 57, 31, 11, 4, 22, 47, 6, 2307, 51, 9, 170, 23, 595, 116, 595, 1352, 13, 191, 79, 638, 89, 2, 14, 9, 8, 106, 607, 624, 35, 534, 6, 227, 7, 129, 113]) list([1, 4, 2, 2, 33, 2804, 4, 2040, 432, 111, 153, 103, 4, 1494, 13, 70, 131, 67, 11, 61, 2, 744, 35, 3715, 761, 61, 5766, 452, 9214, 4, 985, 7, 2, 59, 166, 4, 105, 216, 1239, 41, 1797, 9, 15, 7, 35, 744, 2413, 31, 8, 4, 687, 23, 4, 2, 7339, 6, 3693, 42, 38, 39, 121, 59, 456, 10, 10, 7, 265, 12, 575, 111, 153, 159, 59, 16, 1447, 21, 25, 586, 482, 39, 4, 96, 59, 716, 12, 4, 172, 65, 9, 579, 11, 6004, 4, 1615, 5, 2, 7, 5168, 17, 13, 7064, 12, 19, 6, 464, 31, 314, 11, 2, 6, 719, 605, 11, 8, 202, 27, 310, 4, 3772, 3501, 8, 2722, 58, 10, 10, 537, 2116, 180, 40, 14, 413, 173, 7, 263, 112, 37, 152, 377, 4, 537, 263, 846, 579, 178, 54, 75, 71, 476, 36, 413, 263, 2504, 182, 5, 17, 75, 2306, 922, 36, 279, 131, 2895, 17, 2867, 42, 17, 35, 921, 2, 192, 5, 1219, 3890, 19, 2, 217, 4122, 1710, 537, 2, 1236, 5, 736, 10, 10, 61, 403, 9, 2, 40, 61, 4494, 5, 27, 4494, 159, 90, 263, 2311, 4319, 309, 8, 178, 5, 82, 4319, 4, 65, 15, 9225, 145, 143, 5122, 12, 7039, 537, 746, 537, 537, 15, 7979, 4, 2, 594, 7, 5168, 94, 9096, 3987, 2, 11, 2, 4, 538, 7, 1795, 246, 2, 9, 2, 11, 635, 14, 9, 51, 408, 12, 94, 318, 1382, 12, 47, 6, 2683, 936, 5, 6307, 2, 19, 49, 7, 4, 1885, 2, 1118, 25, 80, 126, 842, 10, 10, 2, 2, 4726, 27, 4494, 11, 1550, 3633, 159, 27, 341, 29, 2733, 19, 4185, 173, 7, 90, 2, 8, 30, 11, 4, 1784, 86, 1117, 8, 3261, 46, 11, 2, 21, 29, 9, 2841, 23, 4, 1010, 2, 793, 6, 2, 1386, 1830, 10, 10, 246, 50, 9, 6, 2750, 1944, 746, 90, 29, 2, 8, 124, 4, 882, 4, 882, 496, 27, 2, 2213, 537, 121, 127, 1219, 130, 5, 29, 494, 8, 124, 4, 882, 496, 4, 341, 7, 27, 846, 10, 10, 29, 9, 1906, 8, 97, 6, 236, 2, 1311, 8, 4, 2, 7, 31, 7, 2, 91, 2, 3987, 70, 4, 882, 30, 579, 42, 9, 12, 32, 11, 537, 10, 10, 11, 14, 65, 44, 537, 75, 2, 1775, 3353, 2, 1846, 4, 2, 7, 154, 5, 4, 518, 53, 2, 2, 7, 3211, 882, 11, 399, 38, 75, 257, 3807, 19, 2, 17, 29, 456, 4, 65, 7, 27, 205, 113, 10, 10, 2, 4, 2, 2, 9, 242, 4, 91, 1202, 2, 5, 2070, 307, 22, 7, 5168, 126, 93, 40, 2, 13, 188, 1076, 3222, 19, 4, 2, 7, 2348, 537, 23, 53, 537, 21, 82, 40, 2, 13, 2, 14, 280, 13, 219, 4, 2, 431, 758, 859, 4, 953, 1052, 2, 7, 5991, 5, 94, 40, 25, 238, 60, 2, 4, 2, 804, 2, 7, 4, 9941, 132, 8, 67, 6, 22, 15, 9, 283, 8, 5168, 14, 31, 9, 242, 955, 48, 25, 279, 2, 23, 12, 1685, 195, 25, 238, 60, 796, 2, 4, 671, 7, 2804, 5, 4, 559, 154, 888, 7, 726, 50, 26, 49, 7008, 15, 566, 30, 579, 21, 64, 2574]) list([1, 249, 1323, 7, 61, 113, 10, 10, 13, 1637, 14, 20, 56, 33, 2401, 18, 457, 88, 13, 2626, 1400, 45, 3171, 13, 70, 79, 49, 706, 919, 13, 16, 355, 340, 355, 1696, 96, 143, 4, 22, 32, 289, 7, 61, 369, 71, 2359, 5, 13, 16, 131, 2073, 249, 114, 249, 229, 249, 20, 13, 28, 126, 110, 13, 473, 8, 569, 61, 419, 56, 429, 6, 1513, 18, 35, 534, 95, 474, 570, 5, 25, 124, 138, 88, 12, 421, 1543, 52, 725, 6397, 61, 419, 11, 13, 1571, 15, 1543, 20, 11, 4, 2, 5, 296, 12, 3524, 5, 15, 421, 128, 74, 233, 334, 207, 126, 224, 12, 562, 298, 2167, 1272, 7, 2601, 5, 516, 988, 43, 8, 79, 120, 15, 595, 13, 784, 25, 3171, 18, 165, 170, 143, 19, 14, 5, 7224, 6, 226, 251, 7, 61, 113])]
max_len = 200
X_train = pad_sequences(X_train, maxlen = max_len)
X_test = pad_sequences(X_test, maxlen = max_len)
print('X_train의 크기(shape) :',X_train.shape)
print('X_test의 크기(shape) :',X_test.shape)
print(y_train[:5])
X_train의 크기(shape) : (25000, 200) X_test의 크기(shape) : (25000, 200) [1 0 0 1 0]
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import load_model
embedding_dim = 256
batch_size = 256
model = Sequential()
model.add(Embedding(vocab_size, 256))
model.add(Dropout(0.3))
model.add(Conv1D(256, 3, padding='valid', activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 3)
mc = ModelCheckpoint('best_model.h5', monitor = 'val_acc', mode = 'max', verbose = 1, save_best_only = True)
model.compile(optimizer='adam', loss = 'binary_crossentropy', metrics = ['acc'])
history = model.fit(X_train, y_train, epochs = 20, validation_data = (X_test, y_test), callbacks=[es, mc])
loaded_model = load_model('best_model.h5')
print("\n 테스트 정확도: %.4f" % (loaded_model.evaluate(X_test, y_test)[1]))
782/782 [==============================] - 38s 48ms/step - loss: 0.2827 - acc: 0.8812 테스트 정확도: 0.8812
3) 1D CNN으로 스팸 메일 분류하기¶
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
urllib.request.urlretrieve("https://raw.githubusercontent.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/master/spam.csv", filename="spam.csv")
data = pd.read_csv('spam.csv', encoding='latin-1')
print('총 샘플의 수 :',len(data))
data[:5]
총 샘플의 수 : 5572
| v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
|---|---|---|---|---|---|
| 0 | ham | Go until jurong point, crazy.. Available only ... | NaN | NaN | NaN |
| 1 | ham | Ok lar... Joking wif u oni... | NaN | NaN | NaN |
| 2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... | NaN | NaN | NaN |
| 3 | ham | U dun say so early hor... U c already then say... | NaN | NaN | NaN |
| 4 | ham | Nah I don't think he goes to usf, he lives aro... | NaN | NaN | NaN |
del data['Unnamed: 2']
del data['Unnamed: 3']
del data['Unnamed: 4']
data['v1'] = data['v1'].replace(['ham','spam'],[0,1])
data['v2'].nunique(), data['v1'].nunique()
data.drop_duplicates(subset=['v2'], inplace=True) # v2 열에서 중복인 내용이 있다면 중복 제거
print('총 샘플의 수 :',len(data))
총 샘플의 수 : 5169
data['v1'].value_counts().plot(kind='bar');
print(data.groupby('v1').size().reset_index(name='count'))
v1 count 0 0 4516 1 1 653
X_data = data['v2']
y_data = data['v1']
print('메일 본문의 개수: {}'.format(len(X_data)))
print('레이블의 개수: {}'.format(len(y_data)))
메일 본문의 개수: 5169 레이블의 개수: 5169
# 정수인코딩
vocab_size = 1000
tokenizer = Tokenizer(num_words = vocab_size)
tokenizer.fit_on_texts(X_data) # 5169개의 행을 가진 X의 각 행에 토큰화를 수행
sequences = tokenizer.texts_to_sequences(X_data) # 단어를 숫자값, 인덱스로 변환하여 저장
print(sequences[:5])
[[47, 433, 780, 705, 662, 64, 8, 94, 121, 434, 142, 68, 57, 137], [49, 306, 435, 6], [53, 537, 8, 20, 4, 934, 2, 220, 706, 267, 70, 2, 2, 359, 537, 604, 82, 436, 185, 707, 437], [6, 226, 152, 23, 347, 6, 138, 145, 56, 152], [935, 1, 97, 96, 69, 453, 2, 877, 69, 198, 105, 438]]
n_of_train = int(len(sequences) * 0.8)
n_of_test = int(len(sequences) - n_of_train)
print('훈련 데이터의 개수 :',n_of_train)
print('테스트 데이터의 개수:',n_of_test)
훈련 데이터의 개수 : 4135 테스트 데이터의 개수: 1034
X_data = sequences
print('메일의 최대 길이 : %d' % max(len(l) for l in X_data))
print('메일의 평균 길이 : %f' % (sum(map(len, X_data))/len(X_data)))
plt.hist([len(s) for s in X_data], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
메일의 최대 길이 : 172 메일의 평균 길이 : 12.566841
# 전체 데이터셋의 길이는 max_len으로 맞춥니다.
max_len = 172
data = pad_sequences(X_data, maxlen = max_len)
print("훈련 데이터의 크기(shape): ", data.shape)
훈련 데이터의 크기(shape): (5169, 172)
X_test = data[n_of_train:] #X_data 데이터 중에서 뒤의 1034개의 데이터만 저장
y_test = np.array(y_data[n_of_train:]) #y_data 데이터 중에서 뒤의 1034개의 데이터만 저장
X_train = data[:n_of_train] #X_data 데이터 중에서 앞의 4135개의 데이터만 저장
y_train = np.array(y_data[:n_of_train]) #y_data 데이터 중에서 앞의 4135개의 데이터만 저장
print("훈련용 이메일 데이터의 크기(shape): ", X_train.shape)
print("테스트용 이메일 데이터의 크기(shape): ", X_test.shape)
print("훈련용 레이블의 크기(shape): ", y_train.shape)
print("테스트용 레이블의 크기(shape): ", y_test.shape)
훈련용 이메일 데이터의 크기(shape): (4135, 172) 테스트용 이메일 데이터의 크기(shape): (1034, 172) 훈련용 레이블의 크기(shape): (4135,) 테스트용 레이블의 크기(shape): (1034,)
from tensorflow.keras.layers import Dense, Conv1D, GlobalMaxPooling1D, Embedding, Dropout, MaxPooling1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
model = Sequential()
model.add(Embedding(vocab_size, 32))
model.add(Dropout(0.2))
model.add(Conv1D(32, 5, strides=1, padding='valid', activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 3)
mc = ModelCheckpoint('best_model.h5', monitor = 'val_acc', mode = 'max', verbose = 1, save_best_only = True)
history = model.fit(X_train, y_train, epochs = 10, batch_size=64, validation_split=0.2, callbacks=[es, mc])
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, None, 32) 32000 _________________________________________________________________ dropout_2 (Dropout) (None, None, 32) 0 _________________________________________________________________ conv1d_1 (Conv1D) (None, None, 32) 5152 _________________________________________________________________ global_max_pooling1d_1 (Glob (None, 32) 0 _________________________________________________________________ dense_2 (Dense) (None, 64) 2112 _________________________________________________________________ dropout_3 (Dropout) (None, 64) 0 _________________________________________________________________ dense_3 (Dense) (None, 1) 65 ================================================================= Total params: 39,329 Trainable params: 39,329 Non-trainable params: 0 _________________________________________________________________ Epoch 1/10 50/52 [===========================>..] - ETA: 0s - loss: 0.4634 - acc: 0.8697 Epoch 00001: val_acc improved from -inf to 0.87304, saving model to best_model.h5 52/52 [==============================] - 1s 27ms/step - loss: 0.4612 - acc: 0.8694 - val_loss: 0.3812 - val_acc: 0.8730 Epoch 2/10 51/52 [============================>.] - ETA: 0s - loss: 0.3565 - acc: 0.8689 Epoch 00002: val_acc did not improve from 0.87304 52/52 [==============================] - 1s 25ms/step - loss: 0.3542 - acc: 0.8697 - val_loss: 0.2762 - val_acc: 0.8730 Epoch 3/10 51/52 [============================>.] - ETA: 0s - loss: 0.1499 - acc: 0.9406 Epoch 00003: val_acc improved from 0.87304 to 0.98549, saving model to best_model.h5 52/52 [==============================] - 1s 26ms/step - loss: 0.1496 - acc: 0.9407 - val_loss: 0.0781 - val_acc: 0.9855 Epoch 4/10 51/52 [============================>.] - ETA: 0s - loss: 0.0528 - acc: 0.9881 Epoch 00004: val_acc did not improve from 0.98549 52/52 [==============================] - 1s 25ms/step - loss: 0.0530 - acc: 0.9876 - val_loss: 0.0559 - val_acc: 0.9843 Epoch 5/10 51/52 [============================>.] - ETA: 0s - loss: 0.0346 - acc: 0.9911 Epoch 00005: val_acc did not improve from 0.98549 52/52 [==============================] - 1s 25ms/step - loss: 0.0344 - acc: 0.9912 - val_loss: 0.0529 - val_acc: 0.9807 Epoch 6/10 50/52 [===========================>..] - ETA: 0s - loss: 0.0218 - acc: 0.9928 Epoch 00006: val_acc did not improve from 0.98549 52/52 [==============================] - 1s 26ms/step - loss: 0.0213 - acc: 0.9930 - val_loss: 0.0516 - val_acc: 0.9807 Epoch 7/10 51/52 [============================>.] - ETA: 0s - loss: 0.0139 - acc: 0.9969 Epoch 00007: val_acc did not improve from 0.98549 52/52 [==============================] - 1s 26ms/step - loss: 0.0138 - acc: 0.9970 - val_loss: 0.0530 - val_acc: 0.9807 Epoch 8/10 51/52 [============================>.] - ETA: 0s - loss: 0.0092 - acc: 0.9982 Epoch 00008: val_acc did not improve from 0.98549 52/52 [==============================] - 1s 26ms/step - loss: 0.0091 - acc: 0.9982 - val_loss: 0.0533 - val_acc: 0.9807 Epoch 9/10 51/52 [============================>.] - ETA: 0s - loss: 0.0081 - acc: 0.9982 Epoch 00009: val_acc did not improve from 0.98549 52/52 [==============================] - 1s 27ms/step - loss: 0.0080 - acc: 0.9982 - val_loss: 0.0542 - val_acc: 0.9807 Epoch 00009: early stopping
print("\n 테스트 정확도: %.4f" % (model.evaluate(X_test, y_test)[1]))
33/33 [==============================] - 0s 3ms/step - loss: 0.0706 - acc: 0.9787 테스트 정확도: 0.9787
4) Multi-Kernel 1D CNN으로 네이버 영화 리뷰 분류¶
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense, Input, Flatten, Concatenate
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.models import load_model
# 하이퍼 파라미터 정의
embedding_dim = 128
dropout_prob = (0.5, 0.8)
num_filters = 128
# 입력층과 임베딩 층 정의 -> 50% 드롭아웃
model_input = Input(shape = (max_len,))
z = Embedding(vocab_size, embedding_dim, input_length = max_len, name="embedding")(model_input)
z = Dropout(dropout_prob[0])(z)
# maxpooling
conv_blocks = []
for sz in [3, 4, 5]:
conv = Conv1D(filters = num_filters,
kernel_size = sz,
padding = "valid",
activation = "relu",
strides = 1)(z)
conv = GlobalMaxPooling1D()(conv)
conv = Flatten()(conv)
conv_blocks.append(conv)
# dense layer로 연결
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Dropout(dropout_prob[1])(z)
z = Dense(128, activation="relu")(z)
model_output = Dense(1, activation="sigmoid")(z)
model = Model(model_input, model_output)
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["acc"])
# 이진분류 시행
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=4)
mc = ModelCheckpoint('CNN_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
model.fit(X_train, y_train, batch_size = 64, epochs=10, validation_data = (X_test, y_test), verbose=2, callbacks=[es, mc])
# 모델 로드, 테스트
loaded_model = load_model('CNN_model.h5')
print("\n 테스트 정확도: %.4f" % (loaded_model.evaluate(X_test, y_test)[1]))
33/33 [==============================] - 1s 39ms/step - loss: 0.3654 - acc: 0.8868 테스트 정확도: 0.8868
from konlpy.tag import *
okt = Okt()
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
True
def sentiment_predict(new_sentence):
new_sentence = okt.morphs(new_sentence, stem=True) # 토큰화
new_sentence = [word for word in new_sentence if not word in stopwords.words('english')] # 불용어 제거
encoded = tokenizer.texts_to_sequences([new_sentence]) # 정수 인코딩
pad_new = pad_sequences(encoded, maxlen = max_len) # 패딩
score = float(model.predict(pad_new)) # 예측
if(score > 0.5):
print("{:.2f}% 확률로 긍정 리뷰입니다.\n".format(score * 100))
else:
print("{:.2f}% 확률로 부정 리뷰입니다.\n".format((1 - score) * 100))
sentiment_predict('이 영화 개꿀잼 ㅋㅋㅋ')
86.70% 확률로 부정 리뷰입니다.
5) 사전 훈련된 워드 임베딩을 이용한 의도 분류¶
- Intent Classification은 개체명인식과 챗봇의 중요 모듈로서 사용된다.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import classification_report
import urllib.request
urllib.request.urlretrieve("https://github.com/ajinkyaT/CNN_Intent_Classification/raw/master/data/train_text.npy", filename="train_text.npy")
urllib.request.urlretrieve("https://github.com/ajinkyaT/CNN_Intent_Classification/raw/master/data/test_text.npy", filename="test_text.npy")
urllib.request.urlretrieve("https://github.com/ajinkyaT/CNN_Intent_Classification/raw/master/data/train_label.npy", filename="train_label.npy")
urllib.request.urlretrieve("https://github.com/ajinkyaT/CNN_Intent_Classification/raw/master/data/test_label.npy", filename="test_label.npy")
('test_label.npy', <http.client.HTTPMessage at 0x7f604dd1d6a0>)
old = np.load
np.load = lambda *a,**k: old(*a,allow_pickle=True,**k)
intent_train = np.load(open('train_text.npy', 'rb')).tolist()
label_train = np.load(open('train_label.npy', 'rb')).tolist()
intent_test = np.load(open('test_text.npy', 'rb')).tolist()
label_test = np.load(open('test_label.npy', 'rb')).tolist()
print('훈련용 문장의 수 :', len(intent_train))
print('훈련용 레이블의 수 :', len(label_train))
print('테스트용 문장의 수 :', len(intent_test))
print('테스트용 레이블의 수 :', len(label_test))
훈련용 문장의 수 : 11784 훈련용 레이블의 수 : 11784 테스트용 문장의 수 : 600 테스트용 레이블의 수 : 600
print(intent_train[:5])
print(label_train[:5])
print(intent_train[2000:2002])
print(label_train[2000:2002])
print(intent_train[4000:4002])
print(label_train[4000:4002])
print(intent_train[6000:6002])
print(label_train[6000:6002])
print(intent_train[8000:8002])
print(label_train[8000:8002])
print(intent_train[10000:10002])
print(label_train[10000:10002])
['add another song to the cita rom ntica playlist', 'add clem burke in my playlist pre party r b jams', 'add live from aragon ballroom to trapeo', 'add unite and win to my night out', 'add track to my digster future hits'] ['AddToPlaylist', 'AddToPlaylist', 'AddToPlaylist', 'AddToPlaylist', 'AddToPlaylist'] ['please book reservations for 3 people at a restaurant in alderwood manor', 'book a table in mt for 3 for now at a pub that serves south indian'] ['BookRestaurant', 'BookRestaurant'] ['what will the weather be like on feb 8 , 2034 in cedar mountain wilderness', "tell me the forecast in the same area here on robert e lee 's birthday"] ['GetWeather', 'GetWeather'] ['rate the current album one points', 'i give a zero rating for this essay'] ['RateBook', 'RateBook'] ["i'm trying to find the show chant ii", 'find spirit of the bush'] ['SearchCreativeWork', 'SearchCreativeWork'] ['when is blood and ice cream trilogie playing at the nearest movie theatre \\?', 'show movie schedules'] ['SearchScreeningEvent', 'SearchScreeningEvent']
temp = pd.Series(label_train)
temp.value_counts().plot(kind = 'bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f604dd94278>
# 레이블 인코딩. 레이블에 고유한 정수를 부여
idx_encode = preprocessing.LabelEncoder()
idx_encode.fit(label_train)
label_train = idx_encode.transform(label_train) # 주어진 고유한 정수로 변환
label_test = idx_encode.transform(label_test) # 고유한 정수로 변환
label_idx = dict(zip(list(idx_encode.classes_), idx_encode.transform(list(idx_encode.classes_))))
print(label_idx)
{'AddToPlaylist': 0, 'BookRestaurant': 1, 'GetWeather': 2, 'RateBook': 3, 'SearchCreativeWork': 4, 'SearchScreeningEvent': 5}
print(intent_train[:5])
print(label_train[:5])
print(intent_test[:5])
print(label_test[:5])
['add another song to the cita rom ntica playlist', 'add clem burke in my playlist pre party r b jams', 'add live from aragon ballroom to trapeo', 'add unite and win to my night out', 'add track to my digster future hits'] [0 0 0 0 0] ["i 'd like to have this track onto my classical relaxations playlist", 'add the album to my flow espa ol playlist', 'add digging now to my young at heart playlist', 'add this song by too poetic to my piano ballads playlist', 'add this album to old school death metal'] [0 0 0 0 0]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(intent_train)
sequences = tokenizer.texts_to_sequences(intent_train)
sequences[:5] # 상위 5개 샘플 출력
[[11, 191, 61, 4, 1, 4013, 1141, 1572, 15], [11, 2624, 1573, 3, 14, 15, 939, 82, 256, 188, 548], [11, 187, 42, 2625, 4014, 4, 1968], [11, 2626, 22, 2627, 4, 14, 192, 27], [11, 92, 4, 14, 651, 520, 195]]
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
print('단어 집합(Vocabulary)의 크기 :',vocab_size)
단어 집합(Vocabulary)의 크기 : 9870
print('문장의 최대 길이 :',max(len(l) for l in sequences))
print('문장의 평균 길이 :',sum(map(len, sequences))/len(sequences))
plt.hist([len(s) for s in sequences], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
문장의 최대 길이 : 35 문장의 평균 길이 : 9.364392396469789
max_len = 35
intent_train = pad_sequences(sequences, maxlen = max_len)
label_train = to_categorical(np.asarray(label_train))
print('전체 데이터의 크기(shape):', intent_train.shape)
print('레이블 데이터의 크기(shape):', label_train.shape)
전체 데이터의 크기(shape): (11784, 35) 레이블 데이터의 크기(shape): (11784, 6)
print(intent_train[0])
print(label_train[0])
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 11 191
61 4 1 4013 1141 1572 15]
[1. 0. 0. 0. 0. 0.]
indices = np.arange(intent_train.shape[0])
np.random.shuffle(indices)
print(indices)
intent_train = intent_train[indices]
label_train = label_train[indices]
n_of_val = int(0.1 * intent_train.shape[0])
print(n_of_val)
[3107 310 9113 ... 9998 3835 4179] 1178
X_train = intent_train[:-n_of_val]
y_train = label_train[:-n_of_val]
X_val = intent_train[-n_of_val:]
y_val = label_train[-n_of_val:]
X_test = intent_test
y_test = label_test
print('훈련 데이터의 크기(shape):', X_train.shape)
print('검증 데이터의 크기(shape):', X_val.shape)
print('훈련 데이터 레이블의 개수(shape):', y_train.shape)
print('검증 데이터 레이블의 개수(shape):', y_val.shape)
print('테스트 데이터의 개수 :', len(X_test))
print('테스트 데이터 레이블의 개수 :', len(y_test))
훈련 데이터의 크기(shape): (10606, 35) 검증 데이터의 크기(shape): (1178, 35) 훈련 데이터 레이블의 개수(shape): (10606, 6) 검증 데이터 레이블의 개수(shape): (1178, 6) 테스트 데이터의 개수 : 600 테스트 데이터 레이블의 개수 : 600
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip
--2020-08-23 15:11:00-- http://nlp.stanford.edu/data/glove.6B.zip Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140 Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected. HTTP request sent, awaiting response... 302 Found Location: https://nlp.stanford.edu/data/glove.6B.zip [following] --2020-08-23 15:11:00-- https://nlp.stanford.edu/data/glove.6B.zip Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following] --2020-08-23 15:11:01-- http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22 Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 862182613 (822M) [application/zip] Saving to: ‘glove.6B.zip’ glove.6B.zip 100%[===================>] 822.24M 2.09MB/s in 6m 28s 2020-08-23 15:17:29 (2.12 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613] Archive: glove.6B.zip inflating: glove.6B.50d.txt inflating: glove.6B.100d.txt inflating: glove.6B.200d.txt inflating: glove.6B.300d.txt
embedding_dict = dict()
f = open(os.path.join('glove.6B.100d.txt'), encoding='utf-8')
for line in f:
word_vector = line.split()
word = word_vector[0]
word_vector_arr = np.asarray(word_vector[1:], dtype='float32') # 100개의 값을 가지는 array로 변환
embedding_dict[word] = word_vector_arr
f.close()
print('%s개의 Embedding vector가 있습니다.' % len(embedding_dict))
400000개의 Embedding vector가 있습니다.
print(embedding_dict['respectable'])
print(len(embedding_dict['respectable']))
[-0.049773 0.19903 0.10585 0.1391 -0.32395 0.44053 0.3947 -0.22805 -0.25793 0.49768 0.15384 -0.08831 0.0782 -0.8299 -0.037788 0.16772 -0.45197 -0.17085 0.74756 0.98256 0.81872 0.28507 0.16178 -0.48626 -0.006265 -0.92469 -0.30625 -0.067318 -0.046762 -0.76291 -0.0025264 -0.018795 0.12882 -0.52457 0.3586 0.43119 -0.89477 -0.057421 -0.53724 0.25587 0.55195 0.44698 -0.24252 0.29946 0.25776 -0.8717 0.68426 -0.05688 -0.1848 -0.59352 -0.11227 -0.57692 -0.013593 0.18488 -0.32507 -0.90171 0.17672 0.075601 0.54896 -0.21488 -0.54018 -0.45882 -0.79536 0.26331 0.18879 -0.16363 0.3975 0.1099 0.1164 -0.083499 0.50159 0.35802 0.25677 0.088546 0.42108 0.28674 -0.71285 -0.82915 0.15297 -0.82712 0.022112 1.067 -0.31776 0.1211 -0.069755 -0.61327 0.27308 -0.42638 -0.085084 -0.17694 -0.0090944 0.1109 0.62543 -0.23682 -0.44928 -0.3667 -0.21616 -0.19187 -0.032502 0.38025 ] 100
embedding_dim = 100
embedding_matrix = np.zeros((vocab_size, embedding_dim))
np.shape(embedding_matrix)
(9870, 100)
for word, i in word_index.items():
embedding_vector = embedding_dict.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense, Input, Flatten, Concatenate
filter_sizes = [2,3,5]
num_filters = 512
drop = 0.5
model_input = Input(shape = (max_len,))
z = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix],
input_length=max_len, trainable=False)(model_input)
conv_blocks = []
for sz in filter_sizes:
conv = Conv1D(filters = num_filters,
kernel_size = sz,
padding = "valid",
activation = "relu",
strides = 1)(z)
conv = GlobalMaxPooling1D()(conv)
conv = Flatten()(conv)
conv_blocks.append(conv)
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Dropout(drop)(z)
model_output = Dense(len(label_idx), activation='softmax')(z)
model = Model(model_input, model_output)
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
model.summary()
Model: "functional_3"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_2 (InputLayer) [(None, 35)] 0
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, 35, 100) 987000 input_2[0][0]
__________________________________________________________________________________________________
conv1d_5 (Conv1D) (None, 34, 512) 102912 embedding_2[0][0]
__________________________________________________________________________________________________
conv1d_6 (Conv1D) (None, 33, 512) 154112 embedding_2[0][0]
__________________________________________________________________________________________________
conv1d_7 (Conv1D) (None, 31, 512) 256512 embedding_2[0][0]
__________________________________________________________________________________________________
global_max_pooling1d_5 (GlobalM (None, 512) 0 conv1d_5[0][0]
__________________________________________________________________________________________________
global_max_pooling1d_6 (GlobalM (None, 512) 0 conv1d_6[0][0]
__________________________________________________________________________________________________
global_max_pooling1d_7 (GlobalM (None, 512) 0 conv1d_7[0][0]
__________________________________________________________________________________________________
flatten_3 (Flatten) (None, 512) 0 global_max_pooling1d_5[0][0]
__________________________________________________________________________________________________
flatten_4 (Flatten) (None, 512) 0 global_max_pooling1d_6[0][0]
__________________________________________________________________________________________________
flatten_5 (Flatten) (None, 512) 0 global_max_pooling1d_7[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 1536) 0 flatten_3[0][0]
flatten_4[0][0]
flatten_5[0][0]
__________________________________________________________________________________________________
dropout_6 (Dropout) (None, 1536) 0 concatenate_1[0][0]
__________________________________________________________________________________________________
dense_6 (Dense) (None, 6) 9222 dropout_6[0][0]
==================================================================================================
Total params: 1,509,758
Trainable params: 522,758
Non-trainable params: 987,000
__________________________________________________________________________________________________
history = model.fit(X_train, y_train,
batch_size=64,
epochs=10,
validation_data = (X_val, y_val))
epochs = range(1, len(history.history['acc']) + 1)
plt.plot(epochs, history.history['acc'])
plt.plot(epochs, history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epochs')
plt.legend(['train', 'test'], loc='lower right')
plt.show()
epochs = range(1, len(history.history['loss']) + 1)
plt.plot(epochs, history.history['loss'])
plt.plot(epochs, history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epochs')
plt.legend(['train', 'test'], loc='upper right')
plt.show()
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=max_len)
y_predicted = model.predict(X_test)
y_predicted = y_predicted.argmax(axis=-1) # 예측된 정수 시퀀스로 변환
y_predicted = idx_encode.inverse_transform(y_predicted) # 정수 시퀀스를 레이블에 해당하는 텍스트 시퀀스로 변환
y_test = idx_encode.inverse_transform(y_test) # 정수 시퀀스를 레이블에 해당하는 텍스트 시퀀스로 변환
print('accuracy: ', sum(y_predicted == y_test) / len(y_test))
print("Precision, Recall and F1-Score:\n\n", classification_report(y_test, y_predicted))
accuracy: 0.98
Precision, Recall and F1-Score:
precision recall f1-score support
AddToPlaylist 1.00 1.00 1.00 100
BookRestaurant 1.00 1.00 1.00 100
GetWeather 0.99 0.99 0.99 100
RateBook 1.00 1.00 1.00 100
SearchCreativeWork 0.91 1.00 0.95 100
SearchScreeningEvent 0.99 0.89 0.94 100
accuracy 0.98 600
macro avg 0.98 0.98 0.98 600
weighted avg 0.98 0.98 0.98 600
태깅작업¶
케라스를 이용한 태깅작업 개요¶
- 개채명 인식기
품사태거 생성
- Bidirectional RNN(many to many)작업
태깅작업은 대표적인 시퀀스 레이블링임.
- X = [x1, x2, x3, ..., xn] -> y = [y1, y2, y3, ..., yn]
양방향 LSTM
- 이전 시점의 단어 정보뿐만 아니라, 다음 시점의 단어정보도 참고하기 위함임.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /root/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! [nltk_data] Downloading package maxent_ne_chunker to [nltk_data] /root/nltk_data... [nltk_data] Package maxent_ne_chunker is already up-to-date! [nltk_data] Downloading package words to /root/nltk_data... [nltk_data] Unzipping corpora/words.zip.
True
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "James is working at Disney in London"
sentence=pos_tag(word_tokenize(sentence))
print(sentence) # 토큰화와 품사 태깅을 동시 수행
[('James', 'NNP'), ('is', 'VBZ'), ('working', 'VBG'), ('at', 'IN'), ('Disney', 'NNP'), ('in', 'IN'), ('London', 'NNP')]
sentence=ne_chunk(sentence)
print(sentence) # 개체명 인식
(S (PERSON James/NNP) is/VBZ working/VBG at/IN (ORGANIZATION Disney/NNP) in/IN (GPE London/NNP))
Named Entity Recognition using Bi-LSTM¶
- BIO 표현 (Begin Inside Outside)
- B와 I는 개체명을 위해 사용, O는 개체명이 아니라는 것을 의미함.
해 B 리 I 포 I 터 I 보 O 러 O 가 O 자 O해 B-movie 리 I-movie 포 I-movie 터 I-movie 보 O 러 O 메 B-theater 가 I-theater 박 I-theater 스 I-theater 가 O 자 O
품사 정보
import re
%matplotlib inline
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import numpy as np
# 데이터 전처리
f = open('/content/train.txt', 'r')
tagged_sentences = []
sentence = []
for line in f:
if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
if len(sentence) > 0:
tagged_sentences.append(sentence)
sentence = []
continue
splits = line.split(' ') # 공백을 기준으로 속성을 구분한다.
splits[-1] = re.sub(r'\n', '', splits[-1]) # 줄바꿈 표시 \n을 제거한다.
word = splits[0].lower() # 단어들은 소문자로 바꿔서 저장한다.
sentence.append([word, splits[-1]]) # 단어와 개체명 태깅만 기록한다.
print("전체 샘플 개수: ", len(tagged_sentences)) # 전체 샘플의 개수 출력
print(tagged_sentences[0]) # 첫번째 샘플 출력
전체 샘플 개수: 8415 [['eu', 'B-ORG'], ['rejects', 'O'], ['german', 'B-MISC'], ['call', 'O'], ['to', 'O'], ['boycott', 'O'], ['british', 'B-MISC'], ['lamb', 'O'], ['.', 'O']]
sentences, ner_tags = [], []
for tagged_sentence in tagged_sentences: # 14,041개의 문장 샘플을 1개씩 불러온다.
sentence, tag_info = zip(*tagged_sentence) # 각 샘플에서 단어들은 sentence에 개체명 태깅 정보들은 tag_info에 저장.
sentences.append(list(sentence)) # 각 샘플에서 단어 정보만 저장한다.
ner_tags.append(list(tag_info)) # 각 샘플에서 개체명 태깅 정보만 저장한다.
# 첫번째 문장 샘플 출력
print(sentences[0])
print(ner_tags[0])
# 열세번째 문장 샘플 출력
print(sentences[12])
print(ner_tags[12])
['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'] ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'] ['only', 'france', 'and', 'britain', 'backed', 'fischler', "'s", 'proposal', '.'] ['O', 'B-LOC', 'O', 'B-LOC', 'O', 'B-PER', 'O', 'O', 'O']
print('샘플의 최대 길이 : %d' % max(len(l) for l in sentences))
print('샘플의 평균 길이 : %f' % (sum(map(len, sentences))/len(sentences)))
plt.hist([len(s) for s in sentences], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
샘플의 최대 길이 : 60 샘플의 평균 길이 : 13.444801
max_words = 4000
src_tokenizer = Tokenizer(num_words=max_words, oov_token='OOV')
src_tokenizer.fit_on_texts(sentences)
tar_tokenizer = Tokenizer()
tar_tokenizer.fit_on_texts(ner_tags)
vocab_size = max_words
tag_size = len(tar_tokenizer.word_index) + 1
print('단어 집합의 크기 : {}'.format(vocab_size))
print('개체명 태깅 정보 집합의 크기 : {}'.format(tag_size))
# 정수 인코딩
X_train = src_tokenizer.texts_to_sequences(sentences)
y_train = tar_tokenizer.texts_to_sequences(ner_tags)
print(X_train[0])
print(y_train[0])
단어 집합의 크기 : 4000 개체명 태깅 정보 집합의 크기 : 10 [1190, 1, 199, 814, 9, 1, 262, 3734, 3] [3, 1, 7, 1, 1, 1, 7, 1, 1]
# 디코딩 (정수->텍스트)
index_to_word = src_tokenizer.index_word
index_to_ner = tar_tokenizer.index_word
decoded = []
for index in X_train[0] : # 첫번째 샘플 안의 인덱스들에 대해서
decoded.append(index_to_word[index]) # 다시 단어로 변환
print('기존 문장 : {}'.format(sentences[0]))
print('빈도수가 낮은 단어가 OOV 처리된 문장 : {}'.format(decoded))
기존 문장 : ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'] 빈도수가 낮은 단어가 OOV 처리된 문장 : ['eu', 'OOV', 'german', 'call', 'to', 'OOV', 'british', 'lamb', '.']
# 패딩
max_len = 70
X_train = pad_sequences(X_train, padding='post', maxlen=max_len)
# X_train의 모든 샘플들의 길이를 맞출 때 뒤의 공간에 숫자 0으로 채움.
y_train = pad_sequences(y_train, padding='post', maxlen=max_len)
# y_train의 모든 샘플들의 길이를 맞출 때 뒤의 공간에 숫자0으로 채움.
# testset, trainset 분리
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=.2, random_state=777)
# 원-핫인코딩
y_train = to_categorical(y_train, num_classes=tag_size)
y_test = to_categorical(y_test, num_classes=tag_size)
print('훈련 샘플 문장의 크기 : {}'.format(X_train.shape))
print('훈련 샘플 레이블의 크기 : {}'.format(y_train.shape))
print('테스트 샘플 문장의 크기 : {}'.format(X_test.shape))
print('테스트 샘플 레이블의 크기 : {}'.format(y_test.shape))
훈련 샘플 문장의 크기 : (6732, 70) 훈련 샘플 레이블의 크기 : (6732, 70, 10) 테스트 샘플 문장의 크기 : (1683, 70) 테스트 샘플 레이블의 크기 : (1683, 70, 10)
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed
from keras.optimizers import Adam
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=128, input_length=max_len, mask_zero=True))
model.add(Bidirectional(LSTM(256, return_sequences=True))) # many-to-many이기 때문에 return_sequences=True
model.add(TimeDistributed(Dense(tag_size, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, epochs=8, validation_data=(X_test, y_test))
print("\n 테스트 정확도: %.4f" % (model.evaluate(X_test, y_test)[1]))
14/14 [==============================] - 5s 335ms/step - loss: 0.1408 - accuracy: 0.8274 테스트 정확도: 0.8274
i=10 # 확인하고 싶은 테스트용 샘플의 인덱스.
y_predicted = model.predict(np.array([X_test[i]])) # 입력한 테스트용 샘플에 대해서 예측 y를 리턴
y_predicted = np.argmax(y_predicted, axis=-1) # 원-핫 인코딩을 다시 정수 인코딩으로 변경함.
true = np.argmax(y_test[i], -1) # 원-핫 인코딩을 다시 정수 인코딩으로 변경함.
print("{:15}|{:5}|{}".format("단어", "실제값", "예측값"))
print(35 * "-")
for w, t, pred in zip(X_test[i], true, y_predicted[0]):
if w != 0: # PAD값은 제외함.
print("{:17}: {:7} {}".format(index_to_word[w], index_to_ner[t].upper(), index_to_ner[pred].upper()))
단어 |실제값 |예측값 ----------------------------------- - : O O OOV : O O iraq : B-LOC O for : O O riots : O O in : O O jordan : B-LOC O is : O O a : O O OOV : O O game : O O . : O O
Bi-LSTM을 이용한 품사태깅¶
nltk.download('treebank')
[nltk_data] Downloading package treebank to /root/nltk_data... [nltk_data] Unzipping corpora/treebank.zip.
True
import nltk
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
tagged_sentences = nltk.corpus.treebank.tagged_sents() # 토큰화에 품사 태깅이 된 데이터 받아오기
print("품사 태깅이 된 문장 개수: ", len(tagged_sentences)) # 문장 샘플의 개수 출력
품사 태깅이 된 문장 개수: 3914
print(tagged_sentences[0]) # 첫번째 샘플 출력
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
sentences, pos_tags = [], []
for tagged_sentence in tagged_sentences: # 3,914개의 문장 샘플을 1개씩 불러온다.
sentence, tag_info = zip(*tagged_sentence) # 각 샘플에서 단어들은 sentence에 품사 태깅 정보들은 tag_info에 저장한다.
sentences.append(list(sentence)) # 각 샘플에서 단어 정보만 저장한다.
pos_tags.append(list(tag_info)) # 각 샘플에서 품사 태깅 정보만 저장한다.
print(sentences[0])
print(pos_tags[0])
print(sentences[8])
print(pos_tags[8])
print('샘플의 최대 길이 : %d' % max(len(l) for l in sentences))
print('샘플의 평균 길이 : %f' % (sum(map(len, sentences))/len(sentences)))
plt.hist([len(s) for s in sentences], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] ['NNP', 'NNP', ',', 'CD', 'NNS', 'JJ', ',', 'MD', 'VB', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'NNP', 'CD', '.'] ['We', "'re", 'talking', 'about', 'years', 'ago', 'before', 'anyone', 'heard', 'of', 'asbestos', 'having', 'any', 'questionable', 'properties', '.'] ['PRP', 'VBP', 'VBG', 'IN', 'NNS', 'IN', 'IN', 'NN', 'VBD', 'IN', 'NN', 'VBG', 'DT', 'JJ', 'NNS', '.'] 샘플의 최대 길이 : 271 샘플의 평균 길이 : 25.722024
def tokenize(samples):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(samples)
return tokenizer
src_tokenizer = tokenize(sentences)
tar_tokenizer = tokenize(pos_tags)
vocab_size = len(src_tokenizer.word_index) + 1
tag_size = len(tar_tokenizer.word_index) + 1
print('단어 집합의 크기 : {}'.format(vocab_size))
print('태깅 정보 집합의 크기 : {}'.format(tag_size))
X_train = src_tokenizer.texts_to_sequences(sentences)
y_train = tar_tokenizer.texts_to_sequences(pos_tags)
print(X_train[:2])
print(y_train[:2])
max_len = 150
X_train = pad_sequences(X_train, padding='post', maxlen=max_len)
# X_train의 모든 샘플의 길이를 맞출 때 뒤의 공간에 숫자 0으로 채움.
y_train = pad_sequences(y_train, padding='post', maxlen=max_len)
# y_train의 모든 샘플의 길이를 맞출 때 뒤의 공간에 숫자 0으로 채움.
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=.2, random_state=777)
y_train = to_categorical(y_train, num_classes=tag_size)
y_test = to_categorical(y_test, num_classes=tag_size)
print('훈련 샘플 문장의 크기 : {}'.format(X_train.shape))
print('훈련 샘플 레이블의 크기 : {}'.format(y_train.shape))
print('테스트 샘플 문장의 크기 : {}'.format(X_test.shape))
print('테스트 샘플 레이블의 크기 : {}'.format(y_test.shape))
단어 집합의 크기 : 11388 태깅 정보 집합의 크기 : 47 [[5601, 3746, 1, 2024, 86, 331, 1, 46, 2405, 2, 131, 27, 6, 2025, 332, 459, 2026, 3], [31, 3746, 20, 177, 4, 5602, 2915, 1, 2, 2916, 637, 147, 3]] [[3, 3, 8, 10, 6, 7, 8, 21, 13, 4, 1, 2, 4, 7, 1, 3, 10, 9], [3, 3, 17, 1, 2, 3, 3, 8, 4, 3, 19, 1, 9]] 훈련 샘플 문장의 크기 : (3131, 150) 훈련 샘플 레이블의 크기 : (3131, 150, 47) 테스트 샘플 문장의 크기 : (783, 150) 테스트 샘플 레이블의 크기 : (783, 150, 47)
from keras.models import Sequential
from keras.layers import Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding
from keras.optimizers import Adam
model = Sequential()
model.add(Embedding(vocab_size, 128, input_length=max_len, mask_zero=True))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(TimeDistributed(Dense(tag_size, activation=('softmax'))))
model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, epochs=6, validation_data=(X_test, y_test))
Epoch 1/6 25/25 [==============================] - 72s 3s/step - loss: 0.5738 - accuracy: 0.1383 - val_loss: 0.5069 - val_accuracy: 0.1691 Epoch 2/6 25/25 [==============================] - 70s 3s/step - loss: 0.4924 - accuracy: 0.2307 - val_loss: 0.4630 - val_accuracy: 0.3586 Epoch 3/6 25/25 [==============================] - 70s 3s/step - loss: 0.4119 - accuracy: 0.4305 - val_loss: 0.3302 - val_accuracy: 0.5008 Epoch 4/6 25/25 [==============================] - 70s 3s/step - loss: 0.2624 - accuracy: 0.6022 - val_loss: 0.1979 - val_accuracy: 0.7050 Epoch 5/6 25/25 [==============================] - 70s 3s/step - loss: 0.1459 - accuracy: 0.8095 - val_loss: 0.1109 - val_accuracy: 0.8551 Epoch 6/6 25/25 [==============================] - 70s 3s/step - loss: 0.0772 - accuracy: 0.9050 - val_loss: 0.0721 - val_accuracy: 0.8940
<tensorflow.python.keras.callbacks.History at 0x7feb5cf50f98>
print("\n 테스트 정확도: %.4f" % (model.evaluate(X_test, y_test)[1]))
index_to_word=src_tokenizer.index_word
index_to_tag=tar_tokenizer.index_word
i=10 # 확인하고 싶은 테스트용 샘플의 인덱스.
y_predicted = model.predict(np.array([X_test[i]])) # 입력한 테스트용 샘플에 대해서 예측 y를 리턴
y_predicted = np.argmax(y_predicted, axis=-1) # 원-핫 인코딩을 다시 정수 인코딩으로 변경함.
true = np.argmax(y_test[i], -1) # 원-핫 인코딩을 다시 정수 인코딩으로 변경함.
print("{:15}|{:5}|{}".format("단어", "실제값", "예측값"))
print(35 * "-")
for w, t, pred in zip(X_test[i], true, y_predicted[0]):
if w != 0: # PAD값은 제외함.
print("{:17}: {:7} {}".format(index_to_word[w], index_to_tag[t].upper(), index_to_tag[pred].upper()))
25/25 [==============================] - 7s 299ms/step - loss: 0.0721 - accuracy: 0.8940 테스트 정확도: 0.8940 단어 |실제값 |예측값 ----------------------------------- in : IN IN addition : NN NN , : , , buick : NNP NNP is : VBZ VBZ a : DT DT relatively : RB RB respected : VBN VBN nameplate : NN NN among : IN IN american : NNP NNP express : NNP NNP card : NN NN holders : NNS NNS , : , , says : VBZ VBZ 0 : -NONE- -NONE- *t*-1 : -NONE- -NONE- an : DT DT american : NNP NNP express : NNP NNP spokeswoman : NN NN . : . .
서브워드 토크나이저 (Subword Tokenizer)¶
바이트 페어 인코딩 (Byte Pair Encoding, BPE)¶
- 기계가 모르는 단어가 등장하게 되면, 그 단어를 단어집합에 없는 단어라고 해서 OOV(Out-Of-Vocabulary, OOV)라고 표현함.
- OOV 문제를 해결하기 위해, 서브워드 분리(Subword segmentation)작업을 사용
- 하나의 단어는 여러 서브워드들(birthplace = birth + place)의 조합으로 구성된 경우가 많음.
- 하나의 단어를 여러 서브워드로 분리해서 인코딩 및 임베딩하겠다는 의도.
BPE (Byte Pair Encoding)¶
- 1994년에 제안된 데이터 압축 알고리즘.
- 구글이 번역기에 사용한 알고리즘.
- BERT 훈련에 사용.
aaabdaaabac
-> Z=aa
ZabdZabac
-> Y=ab
ZYdZYac
-> X=ZY
XdXac
# dictionary
# 훈련 데이터에 있는 단어와 등장 빈도수
low : 5, lower : 2, newest : 6, widest : 3
위와 같은 단어:빈도수의 구성을 임의로 딕셔너리라고 했을 때,
이 딕셔너리의 단어집합은 아래와같이 구성됨.
# vocabulary
low, lower, newest, widest
여기에서 lowest라는 단어가 등장하면 OOV 문제가 발생하게 됨.
BPE를 적용하게 되면,
# dictionary
l o w : 5, l o w e r : 2, n e w e s t : 6, w i d e s t : 3
# vocabulary
l, o, w, e, r, n, w, s, t, i, d
- 빈도수가 9로 가장 높은 (e, s)의 쌍을 es로 통합.
# dictionary update!
l o w : 5,
l o w e r : 2,
n e w es t : 6,
w i d es t : 3
# vocabulary update!
l, o, w, e, r, n, w, s, t, i, d, es
- 빈도수가 9로 가장 높은 (es, t)의 쌍을 est로 통합.
# dictionary update!
l o w : 5,
l o w e r : 2,
n e w est : 6,
w i d est : 3
# vocabulary update!
l, o, w, e, r, n, w, s, t, i, d, es, est
- 빈도수가 7로 가장 높은 (l, o)의 쌍을 lo로 통합.
# dictionary update!
lo w : 5,
lo w e r : 2,
n e w est : 6,
w i d est : 3
# vocabulary update!
l, o, w, e, r, n, w, s, t, i, d, es, est, lo
...
...
위와 같은 방식으로 10회 반복 후
# dictionary update!
low : 5,
low e r : 2,
newest : 6,
widest : 3
# vocabulary update!
l, o, w, e, r, n, w, s, t, i, d, es, est, lo, low, ne, new, newest, wi, wid, widest
이 경우, lowest란 단어는 OOV가 되지 않음. (low와 est가 단어집합에 있기 때문)
import re, collections
from IPython.display import display, Markdown, Latex
# BPE 실행 횟수
num_merges = 10
dictionary = {'l o w </w>' : 5,
'l o w e r </w>' : 2,
'n e w e s t </w>':6,
'w i d e s t </w>':3
}
def get_stats(dictionary):
# 유니그램의 pair들의 빈도수를 카운트
pairs = collections.defaultdict(int)
for word, freq in dictionary.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
print('현재 pair들의 빈도수 :', dict(pairs))
return pairs
def merge_dictionary(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
bpe_codes = {}
bpe_codes_reverse = {}
for i in range(num_merges):
display(Markdown("### Iteration {}".format(i + 1)))
pairs = get_stats(dictionary)
best = max(pairs, key=pairs.get)
dictionary = merge_dictionary(best, dictionary)
bpe_codes[best] = i
bpe_codes_reverse[best[0] + best[1]] = best
print("new merge: {}".format(best))
print("dictionary: {}".format(dictionary))
Iteration 1¶
현재 pair들의 빈도수 : {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 8, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('e', 's'): 9, ('s', 't'): 9, ('t', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3}
new merge: ('e', 's')
dictionary: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
Iteration 2¶
현재 pair들의 빈도수 : {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'es'): 6, ('es', 't'): 9, ('t', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'es'): 3}
new merge: ('es', 't')
dictionary: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
Iteration 3¶
현재 pair들의 빈도수 : {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'est'): 6, ('est', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est'): 3}
new merge: ('est', '</w>')
dictionary: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
Iteration 4¶
현재 pair들의 빈도수 : {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'est</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est</w>'): 3}
new merge: ('l', 'o')
dictionary: {'lo w </w>': 5, 'lo w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
Iteration 5¶
현재 pair들의 빈도수 : {('lo', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'est</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est</w>'): 3}
new merge: ('lo', 'w')
dictionary: {'low </w>': 5, 'low e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
Iteration 6¶
현재 pair들의 빈도수 : {('low', '</w>'): 5, ('low', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'est</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est</w>'): 3}
new merge: ('n', 'e')
dictionary: {'low </w>': 5, 'low e r </w>': 2, 'ne w est</w>': 6, 'w i d est</w>': 3}
Iteration 7¶
현재 pair들의 빈도수 : {('low', '</w>'): 5, ('low', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('ne', 'w'): 6, ('w', 'est</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est</w>'): 3}
new merge: ('ne', 'w')
dictionary: {'low </w>': 5, 'low e r </w>': 2, 'new est</w>': 6, 'w i d est</w>': 3}
Iteration 8¶
현재 pair들의 빈도수 : {('low', '</w>'): 5, ('low', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('new', 'est</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est</w>'): 3}
new merge: ('new', 'est</w>')
dictionary: {'low </w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'w i d est</w>': 3}
Iteration 9¶
현재 pair들의 빈도수 : {('low', '</w>'): 5, ('low', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est</w>'): 3}
new merge: ('low', '</w>')
dictionary: {'low</w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'w i d est</w>': 3}
Iteration 10¶
현재 pair들의 빈도수 : {('low', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est</w>'): 3}
new merge: ('w', 'i')
dictionary: {'low</w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'wi d est</w>': 3}
print(bpe_codes)
{('e', 's'): 0, ('es', 't'): 1, ('est', '</w>'): 2, ('l', 'o'): 3, ('lo', 'w'): 4, ('n', 'e'): 5, ('ne', 'w'): 6, ('new', 'est</w>'): 7, ('low', '</w>'): 8, ('w', 'i'): 9}
def get_pairs(word):
"""Return set of symbol pairs in a word.
Word is represented as a tuple of symbols (symbols being variable-length strings).
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs
def encode(orig):
"""Encode word based on list of BPE merge operations, which are applied consecutively"""
word = tuple(orig) + ('</w>',)
display(Markdown("__word split into characters:__ <tt>{}</tt>".format(word)))
pairs = get_pairs(word)
if not pairs:
return orig
iteration = 0
while True:
iteration += 1
display(Markdown("__Iteration {}:__".format(iteration)))
print("bigrams in the word: {}".format(pairs))
bigram = min(pairs, key = lambda pair: bpe_codes.get(pair, float('inf')))
print("candidate for merging: {}".format(bigram))
if bigram not in bpe_codes:
display(Markdown("__Candidate not in BPE merges, algorithm stops.__"))
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word)-1 and word[i+1] == second:
new_word.append(first+second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
print("word after merging: {}".format(word))
if len(word) == 1:
break
else:
pairs = get_pairs(word)
# 특별 토큰인 </w>는 출력하지 않는다.
if word[-1] == '</w>':
word = word[:-1]
elif word[-1].endswith('</w>'):
word = word[:-1] + (word[-1].replace('</w>',''),)
return word
encode("loki")
word split into characters: ('l', 'o', 'k', 'i', '')
Iteration 1:
bigrams in the word: {('l', 'o'), ('k', 'i'), ('i', '</w>'), ('o', 'k')}
candidate for merging: ('l', 'o')
word after merging: ('lo', 'k', 'i', '</w>')
Iteration 2:
bigrams in the word: {('lo', 'k'), ('k', 'i'), ('i', '</w>')}
candidate for merging: ('lo', 'k')
Candidate not in BPE merges, algorithm stops.
('lo', 'k', 'i')
encode("lowest")
word split into characters: ('l', 'o', 'w', 'e', 's', 't', '')
Iteration 1:
bigrams in the word: {('s', 't'), ('t', '</w>'), ('o', 'w'), ('w', 'e'), ('l', 'o'), ('e', 's')}
candidate for merging: ('e', 's')
word after merging: ('l', 'o', 'w', 'es', 't', '</w>')
Iteration 2:
bigrams in the word: {('t', '</w>'), ('o', 'w'), ('l', 'o'), ('es', 't'), ('w', 'es')}
candidate for merging: ('es', 't')
word after merging: ('l', 'o', 'w', 'est', '</w>')
Iteration 3:
bigrams in the word: {('l', 'o'), ('est', '</w>'), ('o', 'w'), ('w', 'est')}
candidate for merging: ('est', '</w>')
word after merging: ('l', 'o', 'w', 'est</w>')
Iteration 4:
bigrams in the word: {('w', 'est</w>'), ('l', 'o'), ('o', 'w')}
candidate for merging: ('l', 'o')
word after merging: ('lo', 'w', 'est</w>')
Iteration 5:
bigrams in the word: {('lo', 'w'), ('w', 'est</w>')}
candidate for merging: ('lo', 'w')
word after merging: ('low', 'est</w>')
Iteration 6:
bigrams in the word: {('low', 'est</w>')}
candidate for merging: ('low', 'est</w>')
Candidate not in BPE merges, algorithm stops.
('low', 'est')
encode("lowing")
word split into characters: ('l', 'o', 'w', 'i', 'n', 'g', '')
Iteration 1:
bigrams in the word: {('n', 'g'), ('w', 'i'), ('i', 'n'), ('o', 'w'), ('l', 'o'), ('g', '</w>')}
candidate for merging: ('l', 'o')
word after merging: ('lo', 'w', 'i', 'n', 'g', '</w>')
Iteration 2:
bigrams in the word: {('n', 'g'), ('w', 'i'), ('i', 'n'), ('lo', 'w'), ('g', '</w>')}
candidate for merging: ('lo', 'w')
word after merging: ('low', 'i', 'n', 'g', '</w>')
Iteration 3:
bigrams in the word: {('n', 'g'), ('i', 'n'), ('g', '</w>'), ('low', 'i')}
candidate for merging: ('n', 'g')
Candidate not in BPE merges, algorithm stops.
('low', 'i', 'n', 'g')
RNN을 이용한 인코더, 디코더¶
import pandas as pd
import urllib3
import zipfile
import shutil
import os
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
http = urllib3.PoolManager()
url ='http://www.manythings.org/anki/fra-eng.zip'
filename = 'fra-eng.zip'
path = os.getcwd()
zipfilename = os.path.join(path, filename)
with http.request('GET', url, preload_content=False) as r, open(zipfilename, 'wb') as out_file:
shutil.copyfileobj(r, out_file)
with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
zip_ref.extractall(path)
lines= pd.read_csv('fra.txt', names=['src', 'tar'], sep='\t')
len(lines)
178009
lines = lines.loc[:, 'src':'tar']
lines = lines[0:60000] # 6만개만 저장
lines.sample(10)
| src | tar | |
|---|---|---|
| Unlock the door. | Déverrouille la porte. | CC-BY 2.0 (France) Attribution: tatoeba.org #3... |
| I want to thank you. | Je veux te remercier. | CC-BY 2.0 (France) Attribution: tatoeba.org #2... |
| They formed a circle. | Ils formèrent un cercle. | CC-BY 2.0 (France) Attribution: tatoeba.org #1... |
| She became a nurse. | Elle devint infirmière. | CC-BY 2.0 (France) Attribution: tatoeba.org #5... |
| No one was alive. | Personne n'était vivant. | CC-BY 2.0 (France) Attribution: tatoeba.org #2... |
| What a pretty girl! | Quelle jolie fille ! | CC-BY 2.0 (France) Attribution: tatoeba.org #2... |
| I can't stop writing. | Je ne peux pas m’arrêter d'écrire. | CC-BY 2.0 (France) Attribution: tatoeba.org #1... |
| I love the way you kiss. | J'adore la façon que vous avez d'embrasser. | CC-BY 2.0 (France) Attribution: tatoeba.org #2... |
| Did you question them? | Les avez-vous remis en question ? | CC-BY 2.0 (France) Attribution: tatoeba.org #3... |
| You may be needed. | On pourrait avoir besoin de toi. | CC-BY 2.0 (France) Attribution: tatoeba.org #3... |
lines.tar = lines.tar.apply(lambda x : '\t '+ x + ' \n') # \t: <sos> (시작심볼) \n: <eos> (종료심볼)
lines.sample(10)
| src | tar | |
|---|---|---|
| Explain it to me. | Expliquez-le-moi. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| Come to me. | Venez à moi. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| We'll wait here. | Nous attendrons ici. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| Hey, I want to help you. | Hé, je veux vous aider. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| I rang the doorbell. | J'ai sonné à la porte. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| Tom wore a straw hat. | Tom portait un chapeau de paille. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| It was his best time. | Ça a été son meilleur temps. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| Thanks for the tea. | Merci pour le thé. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| We can live with that. | Nous pouvons vivre avec ça. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
| This isn't enough. | Ça ne suffit pas. | \t \t CC-BY 2.0 (France) Attribution: tatoeba.... |
# 글자 집합 구축
src_vocab=set()
for line in lines.src: # 1줄씩 읽음
for char in line: # 1개의 글자씩 읽음
src_vocab.add(char)
tar_vocab=set()
for line in lines.tar:
for char in line:
tar_vocab.add(char)
src_vocab_size = len(src_vocab)+1
tar_vocab_size = len(tar_vocab)+1
print(src_vocab_size)
print(tar_vocab_size)
104 74
src_vocab = sorted(list(src_vocab))
tar_vocab = sorted(list(tar_vocab))
print(src_vocab[45:75])
print(tar_vocab[45:75])
src_to_index = dict([(word, i+1) for i, word in enumerate(src_vocab)])
tar_to_index = dict([(word, i+1) for i, word in enumerate(tar_vocab)])
print(src_to_index)
print(tar_to_index)
encoder_input = []
for line in lines.src: #입력 데이터에서 1줄씩 문장을 읽음
temp_X = []
for w in line: #각 줄에서 1개씩 글자를 읽음
temp_X.append(src_to_index[w]) # 글자를 해당되는 정수로 변환
encoder_input.append(temp_X)
print(encoder_input[:5])
['V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
['Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
{' ': 1, '!': 2, '"': 3, '$': 4, '%': 5, '&': 6, "'": 7, '(': 8, ')': 9, ',': 10, '-': 11, '.': 12, '0': 13, '1': 14, '2': 15, '3': 16, '4': 17, '5': 18, '6': 19, '7': 20, '8': 21, '9': 22, ':': 23, '?': 24, 'A': 25, 'B': 26, 'C': 27, 'D': 28, 'E': 29, 'F': 30, 'G': 31, 'H': 32, 'I': 33, 'J': 34, 'K': 35, 'L': 36, 'M': 37, 'N': 38, 'O': 39, 'P': 40, 'Q': 41, 'R': 42, 'S': 43, 'T': 44, 'U': 45, 'V': 46, 'W': 47, 'X': 48, 'Y': 49, 'Z': 50, 'a': 51, 'b': 52, 'c': 53, 'd': 54, 'e': 55, 'f': 56, 'g': 57, 'h': 58, 'i': 59, 'j': 60, 'k': 61, 'l': 62, 'm': 63, 'n': 64, 'o': 65, 'p': 66, 'q': 67, 'r': 68, 's': 69, 't': 70, 'u': 71, 'v': 72, 'w': 73, 'x': 74, 'y': 75, 'z': 76, '\xa0': 77, '«': 78, '»': 79, 'À': 80, 'Ç': 81, 'É': 82, 'Ê': 83, 'Ô': 84, 'à': 85, 'â': 86, 'ç': 87, 'è': 88, 'é': 89, 'ê': 90, 'ë': 91, 'î': 92, 'ï': 93, 'ô': 94, 'ù': 95, 'û': 96, 'œ': 97, 'С': 98, '\u2009': 99, '\u200b': 100, '‘': 101, '’': 102, '\u202f': 103}
{'\t': 1, '\n': 2, ' ': 3, '#': 4, '&': 5, '(': 6, ')': 7, '-': 8, '.': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7': 17, '8': 18, '9': 19, ':': 20, 'A': 21, 'B': 22, 'C': 23, 'D': 24, 'E': 25, 'F': 26, 'G': 27, 'H': 28, 'I': 29, 'J': 30, 'K': 31, 'L': 32, 'M': 33, 'N': 34, 'O': 35, 'P': 36, 'Q': 37, 'R': 38, 'S': 39, 'T': 40, 'U': 41, 'V': 42, 'W': 43, 'X': 44, 'Y': 45, 'Z': 46, '_': 47, 'a': 48, 'b': 49, 'c': 50, 'd': 51, 'e': 52, 'f': 53, 'g': 54, 'h': 55, 'i': 56, 'j': 57, 'k': 58, 'l': 59, 'm': 60, 'n': 61, 'o': 62, 'p': 63, 'q': 64, 'r': 65, 's': 66, 't': 67, 'u': 68, 'v': 69, 'w': 70, 'x': 71, 'y': 72, 'z': 73}
[[46, 51, 1, 2], [43, 51, 62, 71, 70, 1, 2], [43, 51, 62, 71, 70, 12], [27, 65, 71, 68, 69, 103, 2], [27, 65, 71, 68, 55, 76, 103, 2]]
decoder_input = []
for line in lines.tar:
temp_X = []
for w in line:
temp_X.append(tar_to_index[w])
decoder_input.append(temp_X)
print(decoder_input[:5])
# <sos> 제거
decoder_target = []
for line in lines.tar:
t=0
temp_X = []
for w in line:
if t>0:
temp_X.append(tar_to_index[w])
t=t+1
decoder_target.append(temp_X)
print(decoder_target[:5])
[[1, 3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 12, 18, 17, 17, 12, 17, 12, 3, 6, 23, 33, 7, 3, 5, 3, 4, 11, 11, 15, 18, 12, 15, 10, 3, 6, 43, 56, 67, 67, 72, 51, 52, 69, 7, 3, 2, 3, 2], [1, 3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 15, 13, 18, 11, 12, 13, 3, 6, 23, 33, 7, 3, 5, 3, 4, 15, 10, 19, 18, 11, 19, 3, 6, 21, 56, 57, 56, 7, 3, 2, 3, 2], [1, 3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 15, 13, 18, 11, 12, 13, 3, 6, 23, 33, 7, 3, 5, 3, 4, 14, 13, 12, 10, 14, 16, 12, 3, 6, 54, 56, 59, 59, 68, 71, 7, 3, 2, 3, 2], [1, 3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 19, 10, 16, 13, 12, 18, 3, 6, 63, 48, 63, 48, 49, 52, 48, 65, 7, 3, 5, 3, 4, 19, 10, 16, 13, 13, 11, 3, 6, 66, 48, 50, 65, 52, 51, 50, 52, 59, 67, 56, 50, 7, 3, 2, 3, 2], [1, 3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 19, 10, 16, 13, 12, 18, 3, 6, 63, 48, 63, 48, 49, 52, 48, 65, 7, 3, 5, 3, 4, 19, 10, 16, 13, 13, 12, 3, 6, 66, 48, 50, 65, 52, 51, 50, 52, 59, 67, 56, 50, 7, 3, 2, 3, 2]] [[3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 12, 18, 17, 17, 12, 17, 12, 3, 6, 23, 33, 7, 3, 5, 3, 4, 11, 11, 15, 18, 12, 15, 10, 3, 6, 43, 56, 67, 67, 72, 51, 52, 69, 7, 3, 2, 3, 2], [3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 15, 13, 18, 11, 12, 13, 3, 6, 23, 33, 7, 3, 5, 3, 4, 15, 10, 19, 18, 11, 19, 3, 6, 21, 56, 57, 56, 7, 3, 2, 3, 2], [3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 15, 13, 18, 11, 12, 13, 3, 6, 23, 33, 7, 3, 5, 3, 4, 14, 13, 12, 10, 14, 16, 12, 3, 6, 54, 56, 59, 59, 68, 71, 7, 3, 2, 3, 2], [3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 19, 10, 16, 13, 12, 18, 3, 6, 63, 48, 63, 48, 49, 52, 48, 65, 7, 3, 5, 3, 4, 19, 10, 16, 13, 13, 11, 3, 6, 66, 48, 50, 65, 52, 51, 50, 52, 59, 67, 56, 50, 7, 3, 2, 3, 2], [3, 1, 3, 23, 23, 8, 22, 45, 3, 12, 9, 10, 3, 6, 26, 65, 48, 61, 50, 52, 7, 3, 21, 67, 67, 65, 56, 49, 68, 67, 56, 62, 61, 20, 3, 67, 48, 67, 62, 52, 49, 48, 9, 62, 65, 54, 3, 4, 19, 10, 16, 13, 12, 18, 3, 6, 63, 48, 63, 48, 49, 52, 48, 65, 7, 3, 5, 3, 4, 19, 10, 16, 13, 13, 12, 3, 6, 66, 48, 50, 65, 52, 51, 50, 52, 59, 67, 56, 50, 7, 3, 2, 3, 2]]
max_src_len = max([len(line) for line in lines.src])
max_tar_len = max([len(line) for line in lines.tar])
print(max_src_len)
print(max_tar_len)
encoder_input = pad_sequences(encoder_input, maxlen=max_src_len, padding='post')
decoder_input = pad_sequences(decoder_input, maxlen=max_tar_len, padding='post')
decoder_target = pad_sequences(decoder_target, maxlen=max_tar_len, padding='post')
encoder_input = to_categorical(encoder_input)
decoder_input = to_categorical(decoder_input)
decoder_target = to_categorical(decoder_target)
72 110
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model
import numpy as np
encoder_inputs = Input(shape=(None, src_vocab_size))
encoder_lstm = LSTM(units=256, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
# encoder_outputs도 같이 리턴받기는 했지만 여기서는 필요없으므로 이 값은 버림.
encoder_states = [state_h, state_c]
# LSTM은 바닐라 RNN과는 달리 상태가 두 개. 바로 은닉 상태와 셀 상태.
decoder_inputs = Input(shape=(None, tar_vocab_size))
decoder_lstm = LSTM(units=256, return_sequences=True, return_state=True)
decoder_outputs, _, _= decoder_lstm(decoder_inputs, initial_state=encoder_states)
# 디코더의 첫 상태를 인코더의 은닉 상태, 셀 상태로 합니다.
decoder_softmax_layer = Dense(tar_vocab_size, activation='softmax')
decoder_outputs = decoder_softmax_layer(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy")
model.fit(x=[encoder_input, decoder_input], y=decoder_target, batch_size=64, epochs=50, validation_split=0.2)
encoder_model = Model(inputs=encoder_inputs, outputs=encoder_states)
# 이전 시점의 상태들을 저장하는 텐서
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
# 문장의 다음 단어를 예측하기 위해서 초기 상태(initial_state)를 이전 시점의 상태로 사용. 이는 뒤의 함수 decode_sequence()에 구현
decoder_states = [state_h, state_c]
# 훈련 과정에서와 달리 LSTM의 리턴하는 은닉 상태와 셀 상태인 state_h와 state_c를 버리지 않음.
decoder_outputs = decoder_softmax_layer(decoder_outputs)
decoder_model = Model(inputs=[decoder_inputs] + decoder_states_inputs, outputs=[decoder_outputs] + decoder_states)
index_to_src = dict((i, char) for char, i in src_to_index.items())
index_to_tar = dict((i, char) for char, i in tar_to_index.items())
def decode_sequence(input_seq):
# 입력으로부터 인코더의 상태를 얻음
states_value = encoder_model.predict(input_seq)
# <SOS>에 해당하는 원-핫 벡터 생성
target_seq = np.zeros((1, 1, tar_vocab_size))
target_seq[0, 0, tar_to_index['\t']] = 1.
stop_condition = False
decoded_sentence = ""
# stop_condition이 True가 될 때까지 루프 반복
while not stop_condition:
# 이점 시점의 상태 states_value를 현 시점의 초기 상태로 사용
output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
# 예측 결과를 문자로 변환
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = index_to_tar[sampled_token_index]
# 현재 시점의 예측 문자를 예측 문장에 추가
decoded_sentence += sampled_char
# <eos>에 도달하거나 최대 길이를 넘으면 중단.
if (sampled_char == '\n' or
len(decoded_sentence) > max_tar_len):
stop_condition = True
# 현재 시점의 예측 결과를 다음 시점의 입력으로 사용하기 위해 저장
target_seq = np.zeros((1, 1, tar_vocab_size))
target_seq[0, 0, sampled_token_index] = 1.
# 현재 시점의 상태를 다음 시점의 상태로 사용하기 위해 저장
states_value = [h, c]
return decoded_sentence
for seq_index in [3,50,100,300,1001]: # 입력 문장의 인덱스
input_seq = encoder_input[seq_index: seq_index + 1]
decoded_sentence = decode_sequence(input_seq)
print(35 * "-")
print('입력 문장:', lines.src[seq_index])
print('정답 문장:', lines.tar[seq_index][1:len(lines.tar[seq_index])-1]) # '\t'와 '\n'을 빼고 출력
print('번역기가 번역한 문장:', decoded_sentence[:len(decoded_sentence)-1]) # '\n'을 빼고 출력